Skip to content

Conversation

Meghagaur
Copy link

Test PR not intented for merge. quick verification of codeserver build on Konflux , checks Dockerfile and platform-specific pylock/pyproject changes.

Copy link

openshift-ci bot commented Oct 8, 2025

Hi @Meghagaur. Thanks for your PR.

I'm waiting for a red-hat-data-services member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Meghagaur
Copy link
Author

/build-konflux

@Nash-123
Copy link

Nash-123 commented Oct 8, 2025

/ok-to-test

@Nash-123
Copy link

Nash-123 commented Oct 8, 2025

/build-konflux

Copy link

openshift-ci bot commented Oct 8, 2025

@Meghagaur: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images c50cb43 link true /test images

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@Meghagaur
Copy link
Author

/build-konflux

1 similar comment
@Meghagaur
Copy link
Author

/build-konflux

@Meghagaur Meghagaur changed the base branch from main to rhoai-3.0 October 9, 2025 09:32
@Nash-123
Copy link

Nash-123 commented Oct 9, 2025

/build-konflux

@Meghagaur Meghagaur force-pushed the s390x-codeserver-me branch from c50cb43 to f6091a7 Compare October 9, 2025 17:08
Copy link

openshift-ci bot commented Oct 9, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign daniellutz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Nash-123
Copy link

/build-konflux

1 similar comment
@Nash-123
Copy link

/build-konflux

@Meghagaur
Copy link
Author

/build-konflux

@jiridanek
Copy link
Member

/build-konflux

jiridanek and others added 4 commits October 15, 2025 08:14
…rm64" (opendatahub-io#2574)

This reverts commit c8ff00a

Originally added in opendatahub-io#1396 because of
Pipenv limitations that are no longer present in uv.
…pipeline

The error message you provided, `(error: exit status 1; output: write /opt/app-root/lib/python3.12/site-packages/nvidia/nccl/lib/libnccl.so.2: no space left on device)`, indicates a **disk space limitation** encountered during the container build process, specifically while writing an Nvidia-related Python package file.

This issue, commonly reported as `No space left on device`, generally occurs when the build pipeline attempts to write more data than is available in a shared volume or local ephemeral storage.

Here is a detailed analysis and potential solutions based on the source material, particularly those dealing with large container images and multi-platform builds:

### Root Cause and General Solution

1.  **Shared Volume Overflow:** The error likely means your build pipeline is consuming too much space in a shared volume, typically the workspace declared in your `PipelineRun` YAML. The default Tekton workspace size is often small (e.g., 1GB).
2.  **Solution: Increase Workspace Storage:** The standard recommendation is to **request more disk space** by increasing the storage value within the `.spec.workspaces` section of your relevant `PipelineRun` files:
    *   For example, one user solved a `prefetch-dependencies` task failure by increasing storage to 2Gi.
    *   However, for very large builds (like those involving AI/Nvidia libraries, as your error suggests), you may need significantly more space. One user noted that large images may require **2 to 3 times the actual file size** during building and tagging.

### Context specific to Large/AI/Multi-Arch Builds

Your specific error involving `/opt/app-root/lib/python3.12/site-packages/nvidia/nccl/lib/libnccl.so.2` places this failure within the context of building large containers that rely on extensive dependencies, often seen with RHEL AI/AIPCC teams.

*   **Large Image Size:** Tasks involving machine learning or large model images (sometimes referred to as "modelcar" images) frequently face this issue because the artifacts being built are very large. For instance, certain AI models require ephemeral storage volumes that can exceed 200Gi.
*   **Aarch64/ARM64 Architecture:** In known instances of this exact type of error (involving unpacking Nvidia/vllm dependencies), the failure consistently occurred on **aarch64/arm64 builds**, while x86_64 builds passed. Default disk size for these nodes may be limited (e.g., 40Gb).
*   **Failure Point:** The failure you see is occurring during the process of **copying layers and metadata for the container**, likely during the unpacking or committing phase, pointing to the local ephemeral storage running out of space.

### Platform-Specific Workaround

Since the issue seems related to the underlying machine size, especially if you are targeting AArch64 (ARM64), a suggested fix is to utilize a remote platform with guaranteed larger disk space:

*   **Override Platform:** You can try replacing the default build platform in your `PipelineRun` configuration with a larger machine type. For **arm64** builds experiencing this failure, a recommendation was made to use `linux-d160-m2xlarge/arm64`, which provides **160 GB of disk space**.
*   **Configuring `buildah-remote`:** If you are using a remote build task (like `buildah-remote`), you would need to specify this larger platform flavor in the configuration.

If increasing the default workspace size is insufficient, addressing the underlying node size for large builds is crucial.
…age components and new build-platform entries for specific components
…io#2568)

* Enabled TrustyAI Notebook for s390x

Signed-off-by: Nishan Acharya <[email protected]>

* Address Comments

Signed-off-by: Nishan Acharya <[email protected]>

* Removed EPEL from output stage

Signed-off-by: Nishan Acharya <[email protected]>

* Add s390x label for konflux

Signed-off-by: Nishan Acharya <[email protected]>

---------

Signed-off-by: Nishan Acharya <[email protected]>
atheo89 and others added 8 commits October 15, 2025 10:40
…io#2575)

* s390x changes for codeserver

* Update get_code_server_rpm.sh

Fix conditional syntax for architecture guard.
The missing space before "$ARCH" turns the token into ||"$ARCH", so bash complains with “conditional binary operator expected” and exits before any build logic runs.

as suggested by code coderabbitai

* Update devel_env_setup.sh

changed to do proper && chain for dnf

* add files via lfs

---------

Co-authored-by: aryabjena <[email protected]>
Add BASE_IMAGE as buildArg for the builds configs and remove the aipcc bases as we got unauthorized access
Switch from aippc to ubi/rhel images for rstudio
@Meghagaur Meghagaur force-pushed the s390x-codeserver-me branch from a91156b to 9101b77 Compare October 15, 2025 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants